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Abstract 

The diversity and abundance of non-long terminal repeat (LTR) retrotransposons (nLTR-RT) differ drastically among vertebrate 
genomes. At one extreme, the genome of placental mammals is littered with hundreds of thousands of copies resulting from 
the activity of a single clade of nLTR-RT, the LI clade. In contrast, fish genomes contain a much more diverse repertoire of 
nLTR-RT, represented by numerous active clades and families. Yet, the number of nLTR-RT copies in teleostean fish is two 
orders of magnitude smaller than in mammals. The vast majority of insertions appear to be very recent, suggesting that nLTR- 
RT do not accumulate in fish genomes. This pattern had previously been explained by a high rate of turnover, in which the 
insertion of new elements is offset by the selective loss of deleterious inserts. The turnover model was proposed because of 
the similarity between fish and Drosophila genomes with regard to their nLTR-RT profile. However, it is unclear if this model 
applies to fish. In fact, a previous study performed on the puffer fish suggested that transposable element insertions behave 
as neutral alleles. Here we examined the dynamics of amplification of nLTR-RT in the three-spine stickleback (Gasterosteus 
aculeatus). In this species, the vast majority of nLTR-RT insertions are relatively young, as suggested by their low level of 
divergence. Contrary to expectations, a majority of these insertions are fixed in lake and oceanic populations; thus, nLTR-RT 
do indeed accumulate in the genome of their fish host. This is not to say that nLTR-RTs are fully neutral, as the lack of fixed 
long elements in this genome suggests a deleterious effect related to their length. This analysis does not support the turnover 
model and strongly suggests that a much higher rate of DNA loss in fish than in mammals is responsible for the relatively 
small number of nLTR-RT copies and for the scarcity of ancient elements in fish genomes. We further demonstrate that 
nLTR-RT decay in fish occurs mostly through large deletions and not by the accumulation of small deletions. 
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Introduction 

Non-long terminal repeat (LTR) retrotransposons (nLTR-RT) 
are mobile elements in the genome that replicate using an 
RNA intermediate and lack LTRs. They have considerably af- 
fected the size, structure, and function of vertebrate ge- 
nomes. In fact, the abundance of nLTR-RT is one of the 
major determinants of genome size differences among ver- 
tebrates. The impact nLTR-RTs have on their host is directly 
related to their diversity and abundance, which differ con- 
siderably among vertebrate groups. In mammals, nLTR-RTs 
are extremely abundant and account for as much as 30% of 



genome size (Lander et al. 2001; Waterston et al. 2002). 
Mammalian genomes are dominated by a single clade of 
nLTR-RT called LI (Furano 2000). LI has been amplifying 
since the origin of the eutharian radiation and has accumu- 
lated to considerable numbers, accounting for the large ge- 
nome size of mammals (2.0-3.6 GB). In stark contrast, the 
genomes of teleostean fish and squamate reptiles tend to be 
small and to contain an extraordinary diversity of active 
nLTR-RT, generally representing multiple clades (Volff 
et al. 2003; Duvernell et al. 2004; Furano et al. 2004; Novick 
et al. 2009). These clades are generally represented by mul- 
tiple and distinct groups of sequences, called families, that 
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are concurrently active. Families of elements are usually rep- 
resented by small numbers (10 to a few hundreds) of very 
similar copies, suggesting that the majority of insertions are 
recent and do not accumulate in the genome of the host 
(Duvernell et al. 2004; Furano et al. 2004). The young 
age and small copy number of nLTR-RT in fish is suggestive 
of a rapid turnover of elements, in which the insertion of 
new elements is offset by the selective loss of element-con- 
taining loci. However, the turnover model has not been rig- 
orously tested in fish and was proposed because of the 
similarity between fish and Drosophila with regard to their 
nLTR-RT profile (Duvernell et al. 2004; Furano et al. 2004). In 
fact, the only population study done on a fish, the puffer 
fish, found a high number of fixed and high frequency in- 
sertions, suggesting that nLTR-RT are neutral, at least in this 
fish species (Neafsey et al. 2004). 

Teleostean fish constitute the most diverse vertebrate 
group, and this diversity is also reflected in the diversity 
of their genome size and structure (Volff 2005). A bioinfor- 
matic exploration of teleostean genomes has revealed 
considerable differences in the diversity and abundance 
of nLTR-RT among species (Basta et al. 2007). The factors 
responsible for these differences are not well understood. 
The copy number and family diversity in a given genome 
result from the interactions between the rate of transposi- 
tion, the control of transposition by the host, competition 
between families of elements for host-encoded resources, 
the intensity of selection against new inserts, and the demo- 
graphic history of populations. How these different factors 
interact remains unclear because empirical studies in natural 
populations are limited to a very small number of taxa and 
comparative studies are lacking. Here we present a detailed 
analysis of nLTR-RT in the three-spine stickleback 
(Gasterosteus aculeatus). 

Gasterosteus aculeatus is a small teleostean fish that has 
become one of the premier animal models in evolutionary 
biology It is found in the coastal waters of the northern 
Atlantic and Pacific Oceans. It is originally an oceanic 
species, but it has colonized innumerable freshwater habi- 
tats where it has undergone an extremely rapid adaptive 
radiation resulting in morphologically diverse populations 
(Bell and Foster 1994). A draft of the stickleback genome 
has been available since February 2006 on the University 
of California — Santa Cruz (UCSC) genome browser 
(http://genome.ucsc.edu). The individual that was se- 
quenced comes from the Bear Paw Lake population in Alas- 
ka. It was chosen because of the low heterozygosity of this 
population due to isolation since the lake was colonized less 
than 1 4,000 years ago. We performed a bioinformatic anal- 
ysis of the stickleback genome to assess the diversity of 
nLTR-RT in this species. We also determined the frequency 
of nLTR-RT in oceanic and lake populations, in particular 
from the population of origin of the sequenced genome. 
We found that short nLTR-RTs accumulate readily in the 



stickleback genome, whereas full-length copies appear to 
be under purifying selection. However, the near absence 
of ancient nLTR-RT copies suggests that a post-insertional 
mechanism is controlling nLTR-RT copy number in this 
species. We found that a much higher rate of DNA loss 
in fish than in mammals is responsible for the relatively small 
number of nLTR-RT copies and for the paucity of ancient 
elements in fish genomes. 

Materials and Methods 

Coordinates for all nLTR-RT elements were extracted from 
the February 2006 version of the stickleback genome 
(vl.O) using the RepeatMasker table available from the 
UCSC genome browser (www.genome.ucsc.edu). Elements 
were then collected using the coordinates of the elements to 
which 500 bp of downstream and upstream sequences 
were added. In the case of the Maui elements, Repeat- 
Masker did not identify accurately the 5' end of the ele- 
ments; thus, 2 kb of upstream sequences were collected 
in this case. The length of each insertion as well as its start 
and end points were determined. 

Within each clade, elements were aligned to each other 
using ClustalW in BioEdit (Hall 1999) to identify subsets of 
sequences that would represent distinct families. To this 
end, only elements at least 300 bp in length were included. 
Once the elements were aligned, a phylogenetic analysis 
using the neighbor joining and maximum likelihood meth- 
ods implemented in MEGA5.0 was performed. Groups of 
sequences that were well supported by a bootstrap proce- 
dure (1 ,000 iterations; at least 80% bootstrap support) were 
considered valid families. A consensus sequence was deter- 
mined for each family. Each family was characterized by its 
copy number (using a 100-bp cutoff) and its divergence 
used as a proxy of its age. Within-family divergences were 
estimated using the mean pairwise divergence between 
members of the families or the mean divergence between 
each member and the family consensus. Divergences and 
their standard deviation were calculated using MEGA5.0. 

Consensus sequences were aligned to each other. The 
National Center for Biotechnology Information ORF-Finder 
and Conserved Domains tools were used to identify the 
reverse transcriptase (RTase) domain, which was translated 
into amino acid by ORF-Finder. The RTase domains were 
then aligned with the RTase domains of other nLTR-RT 
representative of the major clades of nLTR-RT. Phylogenies 
of the RTase domains were then constructed using the 
maximum likelihood method implemented in MEGA5.0. 

The frequency of RTF insertions was determined experi- 
mentally on ten stickleback populations. The Geographic 
Information System coordinates of the populations are pro- 
vided as Supplementary Material online. The fraction of 
fixed and polymorphic (for presence/absence) insertions 
was determined experimentally. DNA was extracted from 
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the nnuscle or fin of either frozen or ethanol-preserved fish. 
Tissues were digested with proteinase K followed by a phe- 
nol/chloroform extraction and ethanol precipitation. The 
quality of the DNA extraction was verified by electrophoresis 
on a 1 % agarose gel followed by ethidium bromide staining. 
The presence or absence of specific nLTR-RT insertions was 
determined using polymerase chain reaction (PCR). Primers 
in the flanking sequence of the insertions were designed 
manually or using the PrimerB program (Rozen and Skalet- 
sky 2000). The specificity of the primers was verified using 
the in silico PCR tool from the UCSC web page (www. 
genome.ucsc.edu). For inserts longer than 1.5 kb, a second 
PCR was performed using a primer cognate to the flank and 
an internal primer. PCR products were run on 1% agarose 
gels. The sequence of the primers is provided as Supplemen- 
tary Material online. 

Results 

The stickleback genome contains 1 1 families of nLTR-RT 
belonging to 4 of the 28 clades identified previously 
(Kapitonov et al. 2009): the Ll/Txl, L2, Rex/Babar, and RTE 
clades (fig. 1). This level of clade diversity is consistent with 
the analysis of Basta et al. (2007) who used a completely dif- 
ferent approach to identify retrotransposons (McClure et al. 
2005). With -2,396 elements, but only 1 2 full-length copies, 
the most abundant clade, L2, is represented by a single family 
with high similarity to the Maui family previously described in 
Takifugu rubripes (Poulter et al. 1999) (table 1). Notably 
about a third of the elements are shorter than 100 bp, indi- 
cating a high level of fragmentation of these elements. Figure 
2A depicts a phylogenetic tree of Maui elements. This tree has 
the typical cascade structure expected when a single family of 
closely related elements is active in a genome. Elements closer 
to the root represent older copies, whereas clusters of very 
similar sequences indicate recent activity of the family In fact, 
the presence of groups of elements that are identical to each 
other (reflected by the branches of null length) strongly 
suggests that Maui is active in the stickleback. The recent 
activity of Maui is reflected in the relatively low average diver- 
gence of the family (2.2% pairwise divergence; table 1) as 
well as the distribution of pairwise divergence (fig. 3), where 
most values fall under 4% and no values are above 10%. 

The RTE clade is the second most abundant clade of nLTR- 
RT with —2,253 copies including 28 full-length insertions. It 
is represented by the Expander family, which was ohginally 
described from T. rubripes (Kapitonov and Jurka 1999). 
The RepeatMasker output indicates the presence of two 
subsets of Expander: Expander and Expander2. However, 
alignments and phylogenetic analysis of Expander and 
Expander2 reveal that these two putatively different groups 
of RTE are in fact indistinguishable in stickleback and 
correspond to the same family of elements. Thus, they were 
combined in our analysis. The pattern of evolution of 



Expander is similar to Maui as shown on figure 2B. The tree 
strongly indicates that a single family of Expander elements 
has been active in stickleback and probably still is, as 
suggested by the high level of similarity between the most 
recent elements. This recent activity is also reflected in the 
analysis of pairwise divergence between Expander elements 
(fig. 3), which shows a distribution shifted toward low values 
(<5%), suggesting that the vast majority of Expander 
elements have inserted recently in this genome. However, 
we also uncovered a smaller group of elements (14% of 
the total) with much higher divergence (—35% average 
pairwise divergence), indicating that a wave of amplification 
occurred in the stickleback genome a long time ago 
(Expander old in table 1). 

The Rex/Babar clade is represented in the stickleback by 
Rexl, which was originally discovered in Xiphophorus 
maculatus (Volff et al. 2000). More than 1 ,200 Rexl copies 
are found in the stickleback genome. We identified three 
well-supported families we call Rexl -A, Rexl-B, and 
Rexl-C (fig. 4). As only elements at least 300 bp long 
can be accurately classified, we estimated the copy number 
for each subset using a 300-bp cutoff. Rexl -A is the 
dominant family with —570 copies, including four 
full-length elements, whereas Rexl-B and Rexl-C are rep- 
resented by —40 and —130 copies, respectively, and no 
full-length copies. Rexl -B and C appear to have been unable 
to transpose for a long time and are likely to be extinct as 
suggested by their high level of divergence, 19.6% and 
18.5%, respectively The divergence distribution of Rexl- 
A is characterized by a peak at —4%, suggestive of a recent 
activity Yet, the small number of values under 1 % suggests 
that this family has a very low activity in extant stickleback 
populations, which is consistent with the very small number 
of full-length elements detected (fig. 3). 

The most diverse, yet least abundant, clade is Ll/Txl , rep- 
resented by six well-supported families (fig. 5A). Families D, 
E, and F are clearly monophyletic. They are represented by 
highly fragmented elements and are characterized by high 
level of divergence (12.2%-27.4% divergence), suggesting 
they have long been extinct. Because elements belonging to 
families D, E, and F are extremely fragmented, it is impos- 
sible to determine their copy number accurately We can on- 
ly determine that the stickleback genome does not contain 
any full-length element from any of these families. Families 
A, B, and C have a more complex history. Families B and C 
are reciprocally monophyletic, but depending on the section 
of the element used for the phylogenetic reconstruction, the 
position of family A varies. The tree based on the 3' end of 
the element (fig. SA) suggests that A is closer to B, but family 
A is closer to C on the tree built with the 5' region (fig. 56). 
This suggests that family A resulted from a recombination 
event between families B and C. Elements belonging to fam- 
ilies A, B, and C are very similar to each other resulting in 
mean divergences of —1 .0%, 3.0%, and 4.0%, respectively 
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Fig. 1. — Phylogenetic position of the three-spine stickleback elements among the diversity of nLTR-RTs. The stickleback consensus sequences are 
framed in blue. This maximum likelihood tree was constructed from a portion of the translated RTase domain using the rtREV + G + I + F model of 
substitution. The robustness of the nodes was assessed using a bootstrap procedure (500 iterations). 
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Table 1 

Copy Number and Divergence of Stickleback nLTR-RT 



Ciade 


Family 


Copy Number (>100 bp) 


Copy Number (>300 bp) 


Cull 1 nnn+U 

rull-Lengtn 
Copy Number 


Average Pairwise Divergence 
(±Standard Deviation) 


12 


Maui 


2,395 


1,691 


12 


2.2 ± 0.4 


RTE 


Total 


2,253 


1,070 








Expander "recent" 


— 


930 


28 


4.7 ± 0.5 




Expander "old" 




140 


0 


35.6 ± 2.3 


Rexl 


loTai 


1 ,ZDD 


/4U 








Rexl -A 




570 


4 


3.5 ± 0.4 




Rexl-B 




40 


0 


19.6 + 1.8 




Rexl-C 




130 


0 


18.5 ± 1.6 


Ll/Txl 


Total 


405 


268 








A 






5 


1.0 ± 0.2 




B 






4 


3.0 ± 0.4 




C 






0 


4.0 ± 0.5 




D 






0 


20.0 ± 2.0 




E 






0 


12.2 ± 1.5 




F 






0 


27.4 ± 2.4 



(fig. 3). These low values indicate that these three closely 
related fannilies are still active or recently have been active 
in the stickleback. In fact, we identified 5 and 4 full-length 
elements in family A and B, respectively, that show very high 
level of similarity, suggesting they could represent active 
progenitors. 

Although there are some differences of diversity among 
nLTR-RT clades, the vast majority of nLTR-RT insertions tend 
to be recent, with a striking lack of ancient (i.e., divergent) 
elements (fig. 3, bottom panel) and an extreme paucity of 
full-length copies (table 1). There are two nonexclusive ex- 
planations for this observation. First, nLTR-RT insertions 
could fail to accumulate in the stickleback genome due 
to a high rate of turnover in which the insertion of new 
elements is offset by the selective loss of deleterious 
elements. This model is identical to the one proposed for 
the evolution of transposable element copy number in 
Drosophila (Charlesworth B and Charlesworth D 1983; 
Montgomery and Langley 1983; Montgomery et al. 
1987). Second, nLTR-RT could decay rapidly, before or after 
fixation, because of a high rate of DNA loss. To determine if 
nLTR-RT insertions do reach fixation, we experimentally as- 
sessed the polymorphism of 50 Expander insertions repre- 
senting a wide range of divergence in 16 individuals from 
Bear Paw Lake, the population from which fish used for 
the genome project came (table 2). The presence/absence 
of inserts was determined by PCR using primers located 
in the flank of the elements and/or a primer cognate to 
the flank and a primer internal to the element (for long 
inserts). We found that in this population, all insertions 
diverging from their consensus by more than 3% are fixed. 
Although the fraction of elements that are fixed is propor- 
tionally lower in elements that have a low divergence from 
the family consensus, a significant proportion of those low 
divergence elements are also fixed. For instance, out of eight 



elements with divergence between 1% and 2%, six are 
fixed. To estimate the number of fixed Expander elements 
in the stickleback genome, we drew the curve of divergence 
from consensus for all —1,070 Expander elements (fig. 6, 
top panel). We then extrapolated the fraction of fixed 
elements in each divergence category to the entire Expander 
family. Using this approach, we estimated that 710 
Expander elements (i.e., 66% of the insertions) are fixed 
in stickleback. Assuming that all nLTR-RT evolve at the same 
rate, we determined that 72.3% of all nLTR-RT insertions are 
fixed in the Bear Paw Lake population, which corresponds to 
2,725 copies out of 3,769. Although this is a rough 
estimate, a large majority of nLTR-RT is undoubtedly fixed 
in this population. 

It is plausible, however, that the large number of fixed 
insertions in the Bear Paw Lake population results from 
the demographic history of this population. The Bear Paw 
Lake population is characterized by a lower level of genetic 
variation than marine and stream populations, suggesting it 
has a lower effective population size (Aguirre 2007). Smaller 
population size decreases the efficiency of purifying selec- 
tion, allowing the fixation of insertions that otherwise would 
have been eliminated in a population with a large effective 
size. To test this hypothesis, we estimated the frequency of 
the same Expander insertions in nine other populations 
including lake, stream, and oceanic populations (see Supple- 
mentary Material online). Of particular interest is a compar- 
ison with the anadromous (sea-run) Rabbit Slough 
population (A/ = 43), which has apparently not suffered 
any reduction in population size (table 2). This population 
exhibits a level of genetic variation (based on microsatellite 
variation) similar to the one reported in other marine spe- 
cies, which is consistent with a large effective population 
size (Aguirre 2007). We also found that a majority of inser- 
tions are fixed in this population, and using the same 
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Fig. 2. — Phylogenetic relationships among Maui {A) and Expander (6) elements from the three-spine stickleback genome. The trees were 
constructed with the maximum likelihood method using the K2 + G model. Only bootstrap (1,000 iterations) values >80% are shown. 



calculation as above, we estimated that —670 Expander in- 
sertions are fixed (fig. 6, bottom panel), which is very close 
to the estimate obtained for Bear Paw Lake (710 fixed inser- 
tions). We extrapolated these calculations to all nLTR-RT 
families, and we estimated that 73.3% (i.e., 2,765 copies 
out of 3,769) of the elements are fixed, a result remarkably 
close to the estimate for the Bear Paw Lake population. 
Similar calculations performed on the other populations 
provided consistent estimates, suggesting that most inser- 
tions reached fixation before these different populations 
separated. 

These estimates strongly indicate that nLTR-RTs accumu- 
late readily in the stickleback genome; yet, they do not imply 
that insertions are fully neutral in this species. Although the 
number of insertions we screened here is too small to esti- 
mate accurately selection coefficients, our data suggest that 
some insertions are indeed likely to be deleterious. Figure 7 
shows the proportion of fixed and truncated insertions 
relative to the length of the elements. To avoid the con- 
founding effect of demography, this figure was estimated 
using only the Rabbit Slough data. The vast majority 
(—85%) of fixed insertions is severely truncated (<1 kb); 
fixed long (>1 kb) insertions are rare, and we failed to find 
a single fixed full-length insertion. Full-length and truncated 
insertions are produced by target-primed reverse transcrip- 
tion and truncations of the 5' end occur at the time of 
insertion. Thus, the deficiency in fixed full-length elements 
is likely due to a post-insertional process. Although the 
full-length elements could be rapidly lost because of a high 



rate of DNA deletion (see below), it is also possible that the 
lack of fixed full-length elements reflect the differential 
fixation of elements of different lengths. This would imply 
that purifying selection is acting on long elements, thus pre- 
venting their fixation, and suggests that Expander elements 
could be imposing a fitness cost related to the insertion 
length on their host. It remains true, however, that purifying 
selection is insufficient to prevent the fixation of truncated 
elements, which constitute the majority of the inserts. 

We then examined the second explanation for the low 
copy number and the low divergence of nLTR-RT, namely, 
the DNA loss hypothesis. DNA can be lost in two ways, either 
by the accumulation of small (<50 bp) internal deletions or 
by deletions of large segments of sequence. We first exam- 
ined the occurrence of short deletions in elements belong- 
ing to the Maui and Expander families. For comparison, we 
collected —120 LI elements from the human genome rep- 
resenting a similar range of divergence to the stickleback 
elements. Figure 8A shows the number of small deletions 
per kilo base pairs relative to the age of elements. Small 
deletions occur readily in stickleback, at a rate of about 1 
deletion/kb per unit of divergence. This rate of deletion is 
about three times higher than the rate in humans (—0.3 de- 
letion/kb per unit of divergence), suggesting that nLTR-RT 
sequences are much less stable in fish than in humans. How- 
ever, the accumulation of small deletions is insufficient to 
account for the extreme scarcity of elements with diver- 
gence higher than 10%. The fraction of elements deleted 
through the accumulation of small deletions is —0.6% 
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Fig. 3. — Pairwise divergence of families belonging to the four clades recovered from the three-spine stickleback genome, Maui, Expander, Rexl -A, 
and Txl-A, and combined for all families. 



per unit of divergence, meaning that an element with 
a 10% divergence from consensus will have, on average, 
lost only 6% of its length (fig. 8B). Although this value is 
four times higher than the rate of deletion in humans, it 
is clearly insufficient to explain the lack of ancient elements 
in the stickleback genome. 

We then examined the impact of large deletions on the de- 
cay of nLTR-RT sequences. Large deletions will produce highly 
fragmented elements, particularly elements that will lack one 
or both of their termini. The difficulty in assessing the occur- 
rence of large deletions in nLTR-RT results from the diversity of 
structure that can be generated at the time of insertion. In 
particular, a majority of nLTR-RT insertions are truncated in 
5' at the time of insertion, possibly because of premature base 
pairing with the target site (Martin et al. 2005). Thus, when an 
element is missing its 5' end, it is nearly impossible to deter- 
mine if this is the result of a truncation at the time of insertion 



or of a large deletion. Conversely the loss of the 3' extremity 
can only be caused by a DNA deletion. We collected 683 intact 
Expander elements, and for each of them, we scored the be- 
ginning and end of the sequence relative to the full-length 
consensus of the family Elements interrupted by gaps in 
the draft sequence were eliminated. These elements are 
presented on the top panel of figure 9. We first verified that 
elements missing their 3' ends are on average more divergent 
than those with intact 3' ends, which is expected if 3' termini 
are lost post-insertionally and not at the time of insertion. As 
predicted, we found that elements missing their 3' ends are 
more divergent (4.73%) than elements with an intact 3' end 
(2.60%). Figure 9 shows that a large number of elements 
(51.5% of the total) are missing their 3' end and that most 
of them (46.9%) are missing both their 5' and 3' ends. 
The remaining 48.5% can be considered to be intact and have 
presumably not suffered a deletion. Of those, 4% are full 
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Fig. 4. — Phylogenetic relationships among Rexl elements from the three-spine stickleback genome. The tree was constructed with the maximum 
likelihood method using the K2 + G model. The three Rexl families are indicated in brackets. Only bootstrap (1,000 iterations) values >80% are 
shown. 



length and 44.5% are truncated in 5'. Assuming conserva- 
tively that all missing 5' termini were due to truncation and 
that missing 3 ' ends were caused by post-insertional deletions, 
we estimated that at least 37% of the DNA generated by the 
Expander family has been lost by large deletions. This is cer- 
tainly an underestimate as a number of missing 5' ends prob- 
ably resulted from deletion and not truncation. This rate of 
DNA loss was unexpected, considering the age distribution 
of Expander inserts (fig. 3), but it is consistent with the large 
fraction of elements shorter than 300 bp (table 1). For com- 



parison, we performed the same analysis in human sequences 
using 584 LI elements with a range of divergence similar to 
the one of Expander We found that a tiny fraction of LI el- 
ements (<1 %) are missing their 3' end and that the vast ma- 
jority of elements are structurally intact. This difference in 
fragmentation between fish and human nLTR-RT is even more 
striking when one considers that a full-length LI is almost 
twice as long as a full-length Expander and thus should be 
more likely to experience deletions. This analysis demonstrated 
that large deletions occur much more often in stickleback than 
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Fig. 5. — Phylogenetic relationships among Ll/Txl elements using sequences from the 3' terminus (A) and the 5' end (6) of the elements. The 
trees were constructed with the maximum likelihood method using the K2 + G model. Only bootstrap (1,000 iterations) values >80% are shown. 



in humans and are sufficiently common to account for the 
extreme scarcity of ancient elements in the stickleback 
genome. 

Discussion 

The stickleback genome contains four active clades of nLTR- 
RTs, some of which are represented by multiple families of 
elements. There are, however, some interesting differences 
among nLTR-RT clades: the RTE and L2 clades are repre- 
sented by a single family but there are three Rexl and six 
Ll/Txl families. How does this level of diversity compare 
with that of other nonmammalian vertebrates? A previous 
study showed that the stickleback has reduced clade diver- 
sity compared with other teleosteans (Basta et al. 2007). 
Here we showed that this low level of diversity is also found 
at the family level. With six families including only three ac- 
tive ones, the Ll/Txl clade in stickleback is considerably less 
diverse than the Ll/Txl clade in killifish (Duvernell et al. 
2004), zebra fish, which harbor at least 32 distinct families 
(Furano et al. 2004), or in the lizard Anolis carolinensis 
(Novick et al. 2009). Similarly, the L2 clade is represented 
by the sole Maui family, whereas the zebra fish genome 
contains more than 40 L2-related families (based on the an- 



notations of the zebra fish genome at http://genome.ucsc.e- 
du) and the lizard has 17 families (Novick et al. 2009). The 
low level of diversity of Rexl and RTE on the other hand is 
similar to that reported in other taxa as these two clades do 
not seem to diversify to the same extent as the LI or L2 clade 
(Kordis and Gubensek 1998; Volff et al. 2000; Zupunski 
et al. 2001). 

The relatively low copy number and the very recent age 
of nLTR-RT elements in stickleback are reminiscent of the 
situation in the other teleostean genomes examined so 
far (Volff et al. 2003; Duvernell et al. 2004; Furano et al. 
2004; Neafsey et al. 2004). Because of the similarities with 
Drosophila, it was originally proposed that nLTR-RTelements 
in teleosteans are subjected to a high rate of turnover in 
which the insertion of new elements is offset by the selective 
loss of insertions (Duvernell et al. 2004; Furano et al. 2004). 
This model predicts that most elements are deleterious and 
segregate at low frequency in populations. However, the 
high number of fixed insertions in stickleback is not consis- 
tent with the turnover model as it applies to Drosophila. 
There are two nonexclusive explanations for the accumula- 
tion of nLTR-RT insertions in stickleback. First, it is possible 
that most nLTR-RT insertions have no impact on host fitness. 
This hypothesis is consistent with the population genetic 
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Table 2 

Frequency of Insertions Tested by PCR in the Bear Paw Lake and Rabbit Slough Populations 



Locus Number 


Coordinates of Locus 


Length of Insertion 


Divergence from Consensus (%) 


Bear Paw Lake 


KaDDit blougn 


Loci 


chrlX:201 77336-201 77676 


340 


0.00 


0.80 


0.9B 


Loc2 


chrVII:228754-231B52 


2,798 


0.00 


0.10 


0.00 


Loc3 


chrlV:209858B7-209891 80 


3,323 


0.00 


1.00 


0.00 


Loc4 


chrlll:1577186B-1B77519B 


3,330 


0.00 


0.00 


0.37 


Loc5 


chrVII:12188102-12191449 


3,347 


0.00 


1.00 


0.18 


Loc6 


chrlV:243083 17-243 11680 


3,363 


0.00 


0.00 


0.00 


Loc7 


chrVII:1BBS626B-1BSB960B 


3,340 


0.34 


0.33 


0.00 


Loc8 


chrXV:2980897-2981317 


420 


0.34 


0.70 


0.82 


Loc9 


chrlll:6449280-6449749 


469 


0.34 


1.00 


1.00 


LocIO 


chrlV:23261 707-23264743 


3,036 


0.34 


0.20 


0.00 


Locll 


chrll:72BS02-72B889 


387 


0.34 


1.00 


1.00 


Loci 2 


chrXX:1 31 232-1 31647 


415 


0.3B 


0.00 


0.04 


Loci 3 


chrVII;133306B3-13331112 


459 


0.68 


1.00 


0.00 


Loci 4 


chrV:772BB01-772S980 


479 


0.68 


0.67 


1.00 


Loci 5 


chrl:1 1778807-1 1779368 


561 


0.68 


1.00 


1.00 


Loci 6 


chrl:168496B6-168B0343 


687 


0.68 


1.00 


1.00 


Loci 7 


chrl; 12839403-1 2842746 


3,343 


0.69 


0.00 


0.B1 


Loci 8 


chrl;21B73012-21S74283 


1,271 


0.69 


1.00 


0.00 


Loci 9 


chrlV:27086216-270891B1 


2,93B 


0.70 


1.00 


O.BO 


Loc20 


chrlV:2629871 B-26299022 


307 


0.78 


1.00 


NA 


Loc21 


chrVII:11214917-1121722B 


2,308 


1.02 


1.00 


1.00 


Loc22 


chrVI: 1 702B662-1 7026098 


436 


1.03 


0.70 


1.00 


Loc23 


chrVIII:7969S1 3-7969829 


316 


1.03 


1.00 


0.00 


Loc24 


chrlV:2B767136-2B770441 


3,30B 


1.03 


0.43 


0.4B 


Loc25 


chrXIII: 1 B92S609-1 B92601 7 


408 


l.OB 


1.00 


1.00 


Loc26 


chrlV:239B7871-239B8211 


340 


1.37 


1.00 


1.00 


Loc27 


chrlX:2238963-22394B2 


489 


1.40 


1.00 


NA 


Loc28 


chrll: 11 34048-11 34688 


640 


1.81 


1.00 


1.00 


Loc29 


chrl:19360848-19361301 


4S3 


2.44 


0.96 


1.00 


Loc30 


chrll:2180624B-21807448 


1,203 


2.46 


0.00 


1.00 


Loc31 


chrVII:9967491-9968887 


1,396 


2.B1 


0.00 


1.00 


Loc32 


chrXIV;1 0308068-1 0308391 


323 


3.16 


1.00 


1.00 


Loc33 


chrl;200091B0-20009606 


456 


3.37 


1.00 


0.00 


Loc34 


chrl:17B7028S-17S70923 


638 


3.B1 


1.00 


1.00 


Loc35 


chrl:8606080-8606BB2 


472 


3.B3 


1.00 


1.00 


Loc36 


chrlV:99671-100126 


455 


3.B4 


1.00 


1.00 


Loc37 


chrXVI:B4213B7-B421768 


411 


4.22 


1.00 


1.00 


Loc38 


chrV:B918311-B918748 


437 


4.68 


1.00 


1.00 


Loc39 


chrVII: 11 64081 7-1 1642283 


1,466 


B.B2 


1.00 


1.00 


Loc40 


chrXIV:1 4881 766-148821 1 1 


34B 


7.10 


1 .00 


1 .00 


Loc41 


chrlll:620B638-6208177 


2,539 


7.20 


1.00 


1.00 


Loc42 


chrlV:2623936B-26239741 


376 


9.17 


1.00 


1.00 


Loc43 


chrll:9308974-9309348 


374 


14.12 


1.00 


1.00 


Loc44 


chrll:20B7B93B-20B76B07 


572 


20.69 


1.00 


1.00 


Loc45 


chrlll:4938664-4939000 


336 


21.2B 


1.00 


1.00 


Loc46 


chrl:901 1472-901 1777 


305 


24.21 


1.00 


1.00 


Loc47 


chrll:9734B44-9734844 


300 


2B.17 


1.00 


1.00 


Loc48 


chrVII:12290981-12291438 


457 


26.10 


1.00 


1.00 


Loc49 


chrVI:B41B002-B41B472 


470 


3B.41 


1.00 


1.00 


Loc50 


chrlll:6418419-6419108 


689 


3B.70 


1.00 


1.00 



Note. — NA, no amplification. 



analysis of the spotted puffer fish {Tetraodon nigrovihdis) 
performed by Neafsey et al. (2004), who found that most 
elements segregated at high frequency or were fixed in this 
species and behaved as neutral alleles. It is notable that in 



stickleback, the vast majority of fixed insertions are trun- 
cated, suggesting that truncated insertions could be neutral. 
Similarly, in Drosophila and humans, purifying selection acts 
preferentially against long elements, and severely truncated 
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Fig. 6. — Fraction of fixed and polymorphic Expander elements extrapolated from population data. The analysis was performed separately for the 
Bear Paw Lake (A) and the anadromous Rabbit Slough (6) populations. Polymorphic elements were split into elements found at frequencies higher and 
lower than 50%. 



elements behave as neutral or nearly neutral alleles (Petrov 
et al. 2003; Boissinot et al. 2006). 

In contrast, the number of full-length elements is 
extremely small in stickleback for all nLTR-RT families, and 
we failed to find a single fixed full-length insertion. The 
number of full-length insertions found in other teleostean 
genomes is also extremely small, suggesting that a common 
mechanism might limit fixation of full-length insertions in all 
teleosteans (Basta et al. 2007). It is possible that the rate of 




Length of 



Fig. 7. — Fraction of polymorphic and fixed Expander elements 
relative to their length. The distribution is based on 48 insertions 
screened in the Rabbit Slough population. 



DNA loss in stickleback (see below) is sufficiently high to elim- 
inate full-length elements soon after or even before they 
reach fixation. However, the general scarcity of full-length 
elements and the apparent absence of fixed full-length inser- 
tions could also be interpreted as evidence for a strongly 
deleterious effect of these elements, which would prevent 
their fixation. Thus, the turnover model might apply in tele- 
osts but only to full-length elements. A deleterious impact of 
such long elements was not detected in the T. nigroviridis 
study, possibly because only severely truncated elements 
were examined in this study (Neafsey et al. 2004). A delete- 
rious effect of nLTR-RT related to the length of the elements 
has previously been described in Drosophila and in humans 
(Boissinot et al. 2001, 2006; Petrov et al. 2003, 2011) and 
results from the greater ability of long elements to mediate 
ectopic recombination events, which are extremely deleteri- 
ous (Langley et al. 1 988; Song and Boissinot 2007). Although 
our results in stickleback are consistent with the ectopic 
recombination model, it is possible that selection acts specif- 
ically against full-length elements because of a deleterious 
effect related to the transcription or translation of these 
elements (Nuzhdin et al. 1996; Brookfield and Badge 
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Fig. 8. — (/4) Relationship between the number of small deletions and the divergence from consensus for stickleback Expander (y = 1 .037x; fi^ = 
0.3581) and human LI elements (y = 0.3584x; fi^ = 0.4446). (6) Relationship between the fraction of element lost through small deletions and the 
divergence from consensus for stickleback Expander (y = 0.6202x; = 0.2253) and human LI elements (y = 0.145x; = 0.282). 



1997). Whatever the exact mechanism, it is clear that the 
number of full-length elements in fish genomes is strictly lim- 
ited. As full-length elements are the only elements capable of 
transposition, selection limiting the spread of full-length cop- 
ies could reduce the transposition rate and the number of 
new nLTR-RT copies, contributing to the low copy number 
of most families. This could, in part, explain the much greater 
copy number in mammals than in teleosts. Eutherian ge- 
nomes harbor much larger number of active copies than fish 
genomes. For instance, the number of full-length LI active or 



potentially active copies in human and mouse is 80-100 and 
2,000-3,000, respectively (Brouha et al. 2003; Akagi et al. 
2008). Thus, the strength of selection against full-length cop- 
ies in mammals, although significant, does not prevent the 
fixation of a large number of full-length copies, which in turn 
could yield to greater transposition rate and larger families in 
mammals than in fish. 

The high fraction of fixed insertions in stickleback could 
also result from the demographic history of the species. As 
nLTR-RTs are obligatory parasites, their dynamics in the 
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genome is affected by the evolution and natural history of 
their host. Thus, any factor that affects the effective popu- 
lation size (A/e) of the host will modify the equilibrium be- 
tween drift and selection. When A/g is large, like in 
Drosophila, selection dominates over drift, but any factor 
that decreases A/g (e.g., bottleneck, mating system) will 
strengthen drift. In populations with a small A/g, purifying 
selection against deleterious insertions is not acting as effi- 
ciently as in large population. Thus, we expect a higher rate 
of fixation in population that went through a bottleneck or 
a founder effect, as was observed in populations of the plant 
Arabidopsis lyrata and in Drosophila subobscura (Garcia 
Guerreiro et al. 2008; Lockton et al. 2008). A number of re- 
cent studies have examined the amount of genetic variation 
in three-spine stickleback (Hohenlohe et al. 2010; Deagle 
et al. 201 1; Jones et al. 201 2). Three-spine stickleback pop- 
ulations are genetically very diverse, and there is no evidence 
for a reduced effective population at the level of the species 
that could have favored the fixation of a large number of 
nLTR-RT. Thus, it is very unlikely that the large proportion 
of fixed insertions in this genome could be due to a reduction 
in population size. 

Whatever the cause, it remains that a very large number 
of elements reached fixation in three-spine stickleback, and 
it is likely that it has been the case for a long time. Thus, the 
relatively young age of nLTR-RT families and the extreme 
rarity of ancient elements imply that a second mechanism, 
DNA loss, has played a significant role in limiting nLTR-RT 
copy number. Accumulation of small deletions cannot 
account for the rapid decay of insertions, but large deletions 
were frequent enough to rapidly eliminate a large fraction of 
the DNA sequence generated by nLTR-RTactivity. The loss of 
long fragments by large-scale deletion had previously been 
reported in a lizard (Novick et al. 2009) and is apparently the 
major cause of genome shrinkage in plants (Devos et al. 
2002; Ma et al. 2004; Hawkins et al. 2009). The high rate 
of DNA loss by large deletions reported in these taxa is 
certainly sufficient to counteract the amplification of trans- 
posable elements and to limit genome size expansion. In 
contrast, large deletions seem to occur very rarely in mam- 
mals, and this could contribute to the extremely large size of 
mammalian genomes. 

This analysis of nLTR-RT decay in stickleback sheds new 
light on the controversial question of genome size evolu- 
tion. In a landmark paper, Petrov (2002) proposed that the 
genome size reflects an equilibrium between large inser- 
tions that increase genome size and accumulation of small 
deletions that decrease it. This model was based on the 
observation that small deletions occur more frequently 
in insect species with small genomes than in species with 
large genomes (Petrov and HartI 1997; Petrov et al. 2000; 
Bensasson et al. 2001). Petrov's model has been controver- 
sial because even in species where small deletions occur 
frequently this process appears to be too slow to account 



for the small size of these genomes (Gregory 2003, 2004). 
In the original description of the model, large deletions were 
discounted as a significant source of DNA loss because they 
should be very deleterious, particularly in compact genomes 
such as the Drosophila genome. However, it seems that in 
plants and in nonmammalian vertebrates, large deletions 
do occur readily and, based on their frequency, are unlikely 
to be very deleterious. It is indeed surprising that large dele- 
tions are tolerated in these organisms because they could 
affect regulatory or protein coding regions. It is, however, 
possible that these deletions preferentially target repetitive 
DNA and that coding regions are protected from them. 
Clearly more work on the mechanisms and distribution of 
large deletions in vertebrates is required. It should be noted 
that the occurrence of large deletions in other groups, such as 
insects, has yet to be examined in detail. Early studies relied on 
the amplification and cloning of transposable element inser- 
tions or pseudogenes to infer the indel spectrum and conse- 
quently could not capture large deletion (Petrov et al. 2000; 
Bensasson et al. 2001). In conclusion, our analysis does not 
contradict the general idea behind the mutational equilibrium 
model, but we suggest that large deletions certainly play a far 
greater role in the process of DNA loss than originally thought, 
at least in teleostean fish. 

Supplementary Material 

Supplementary materials are available at Genome Biology 
and Evolution online (http://www.gbe.oxfordjournals.org/). 
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